Auto Insurance Claims Fraud Detection¶

Business Requirement¶

An insurance company has approached you with a dataset of previous claims of their clients. The insurance company wants you to develop a model to help them predict which claims look fraudulent. By doing so you hope to save the company millions of dollars annually.

Claim related fraud is a huge problem in the insurance industry. It is quite complex and difficult to identify those unwanted claims. With Random Forest Non-Parametric Machine Learning Algorithm, I am trying to troubleshoot and help the General Insurance industry with this problem.

The data that I have is from Automobile Insurance. I will be creating a predictive model that predicts if an insurance claim is fraudulent or not. The answere between YES/NO, is a Binary Classification task. A comparison study has been performed to understand which ML algorithm suits best to the dataset.

In [1]:
import os
os.getcwd()
Out[1]:
'C:\\Users\\Anoop Mishra'
In [2]:
#Importing required libraries

import pandas as pd
from matplotlib import pyplot as plt
import seaborn as sns
from sklearn.ensemble import ExtraTreesRegressor
from sklearn.preprocessing import LabelEncoder, StandardScaler
from sklearn.model_selection import train_test_split
import numpy as np
import sklearn.metrics
from pylab import rcParams
%matplotlib inline
pd.set_option('display.max_columns', 500)
pd.set_option('display.max_rows', 500)

# pandas version 0.24 or upper is required
pd.__version__
Out[2]:
'1.1.3'
In [3]:
#load & view raw data
df = pd.read_csv('F:/insurance_claims.csv')
df.head(10)
Out[3]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39
0 328 48 521585 17-10-2014 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 25-01-2015 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 Y NaN
1 228 42 342868 27-06-2006 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 21-01-2015 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 Y NaN
2 134 29 687698 06-09-2000 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 22-02-2015 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 N NaN
3 256 41 227811 25-05-1990 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 10-01-2015 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 Y NaN
4 228 44 367455 06-06-2014 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 17-02-2015 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 N NaN
5 256 39 104594 12-10-2006 OH 250/500 1000 1351.10 0 478456 FEMALE PhD tech-support bungie-jumping unmarried 0 0 02-01-2015 Multi-vehicle Collision Rear Collision Major Damage Fire SC Arlington 8973 Washington St 19 3 NO 0 2 NO 64100 6410 6410 51280 Saab 95 2003 Y NaN
6 137 34 413978 04-06-2000 IN 250/500 1000 1333.35 0 441716 MALE PhD prof-specialty board-games husband 0 -77000 13-01-2015 Multi-vehicle Collision Front Collision Minor Damage Police NY Springfield 5846 Weaver Drive 0 3 ? 0 0 ? 78650 21450 7150 50050 Nissan Pathfinder 2012 N NaN
7 165 37 429027 03-02-1990 IL 100/300 1000 1137.03 0 603195 MALE Associate tech-support base-jumping unmarried 0 0 27-02-2015 Multi-vehicle Collision Front Collision Total Loss Police VA Columbus 3525 3rd Hwy 23 3 ? 2 2 YES 51590 9380 9380 32830 Audi A5 2015 N NaN
8 27 33 485665 05-02-1997 IL 100/300 500 1442.99 0 601734 FEMALE PhD other-service golf own-child 0 0 30-01-2015 Single Vehicle Collision Front Collision Total Loss Police WV Arlington 4872 Rock Ridge 21 1 NO 1 1 YES 27700 2770 2770 22160 Toyota Camry 2012 N NaN
9 212 42 636550 25-07-2011 IL 100/300 500 1315.68 0 600983 MALE PhD priv-house-serv camping wife 0 -39300 05-01-2015 Single Vehicle Collision Rear Collision Total Loss Other NC Hillsdale 3066 Francis Ave 14 1 NO 2 1 ? 42300 4700 4700 32900 Saab 92x 1996 N NaN
In [4]:
df.describe()
Out[4]:
months_as_customer age policy_number policy_deductable policy_annual_premium umbrella_limit insured_zip capital-gains capital-loss incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim auto_year _c39
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 1000.000000 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 0.0
mean 203.954000 38.948000 546238.648000 1136.000000 1256.406150 1.101000e+06 501214.488000 25126.100000 -26793.700000 11.644000 1.83900 0.992000 1.487000 52761.94000 7433.420000 7399.570000 37928.950000 2005.103000 NaN
std 115.113174 9.140287 257063.005276 611.864673 244.167395 2.297407e+06 71701.610941 27872.187708 28104.096686 6.951373 1.01888 0.820127 1.111335 26401.53319 4880.951853 4824.726179 18886.252893 6.015861 NaN
min 0.000000 19.000000 100804.000000 500.000000 433.330000 -1.000000e+06 430104.000000 0.000000 -111100.000000 0.000000 1.00000 0.000000 0.000000 100.00000 0.000000 0.000000 70.000000 1995.000000 NaN
25% 115.750000 32.000000 335980.250000 500.000000 1089.607500 0.000000e+00 448404.500000 0.000000 -51500.000000 6.000000 1.00000 0.000000 1.000000 41812.50000 4295.000000 4445.000000 30292.500000 2000.000000 NaN
50% 199.500000 38.000000 533135.000000 1000.000000 1257.200000 0.000000e+00 466445.500000 0.000000 -23250.000000 12.000000 1.00000 1.000000 1.000000 58055.00000 6775.000000 6750.000000 42100.000000 2005.000000 NaN
75% 276.250000 44.000000 759099.750000 2000.000000 1415.695000 0.000000e+00 603251.000000 51025.000000 0.000000 17.000000 3.00000 2.000000 2.000000 70592.50000 11305.000000 10885.000000 50822.500000 2010.000000 NaN
max 479.000000 64.000000 999435.000000 2000.000000 2047.590000 1.000000e+07 620962.000000 100500.000000 0.000000 23.000000 4.00000 2.000000 3.000000 114920.00000 21450.000000 23670.000000 79560.000000 2015.000000 NaN
In [5]:
df.dtypes
Out[5]:
months_as_customer               int64
age                              int64
policy_number                    int64
policy_bind_date                object
policy_state                    object
policy_csl                      object
policy_deductable                int64
policy_annual_premium          float64
umbrella_limit                   int64
insured_zip                      int64
insured_sex                     object
insured_education_level         object
insured_occupation              object
insured_hobbies                 object
insured_relationship            object
capital-gains                    int64
capital-loss                     int64
incident_date                   object
incident_type                   object
collision_type                  object
incident_severity               object
authorities_contacted           object
incident_state                  object
incident_city                   object
incident_location               object
incident_hour_of_the_day         int64
number_of_vehicles_involved      int64
property_damage                 object
bodily_injuries                  int64
witnesses                        int64
police_report_available         object
total_claim_amount               int64
injury_claim                     int64
property_claim                   int64
vehicle_claim                    int64
auto_make                       object
auto_model                      object
auto_year                        int64
fraud_reported                  object
_c39                           float64
dtype: object
In [6]:
df.columns
Out[6]:
Index(['months_as_customer', 'age', 'policy_number', 'policy_bind_date',
       'policy_state', 'policy_csl', 'policy_deductable',
       'policy_annual_premium', 'umbrella_limit', 'insured_zip', 'insured_sex',
       'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'capital-gains', 'capital-loss',
       'incident_date', 'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'incident_location', 'incident_hour_of_the_day',
       'number_of_vehicles_involved', 'property_damage', 'bodily_injuries',
       'witnesses', 'police_report_available', 'total_claim_amount',
       'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make',
       'auto_model', 'auto_year', 'fraud_reported', '_c39'],
      dtype='object')
In [7]:
df.shape
Out[7]:
(1000, 40)
In [8]:
df.nunique()
Out[8]:
months_as_customer              391
age                              46
policy_number                  1000
policy_bind_date                951
policy_state                      3
policy_csl                        3
policy_deductable                 3
policy_annual_premium           991
umbrella_limit                   11
insured_zip                     995
insured_sex                       2
insured_education_level           7
insured_occupation               14
insured_hobbies                  20
insured_relationship              6
capital-gains                   338
capital-loss                    354
incident_date                    60
incident_type                     4
collision_type                    4
incident_severity                 4
authorities_contacted             5
incident_state                    7
incident_city                     7
incident_location              1000
incident_hour_of_the_day         24
number_of_vehicles_involved       4
property_damage                   3
bodily_injuries                   3
witnesses                         4
police_report_available           3
total_claim_amount              763
injury_claim                    638
property_claim                  626
vehicle_claim                   726
auto_make                        14
auto_model                       39
auto_year                        21
fraud_reported                    2
_c39                              0
dtype: int64
In [9]:
plt.style.use('fivethirtyeight')
#ax = sns.distplot(df.age, bins=np.arange(19,64,5))
ax = sns.displot(df.age, bins=np.arange(19,64,5),kde=True)
#ax.set_ylabel('Density')
#ax.set_xlabel('Age')
plt.show()
In [10]:
np.seterr(invalid='ignore') # To remove "RuntimeWarning: invalid value encountered in minimum"
plt.style.use('fivethirtyeight')
ax = sns.countplot(x='fraud_reported', data=df, hue='fraud_reported')
ax.set_xlabel('Fraud Reported')
ax.set_ylabel('Fraud Count')
plt.show();

From above plot, like most fraud datasets, the label distribution is skewed.

In [11]:
df['fraud_reported'].value_counts() # Count number of frauds vs non-frauds
Out[11]:
N    753
Y    247
Name: fraud_reported, dtype: int64
In [12]:
df['incident_state'].value_counts()
Out[12]:
NY    262
SC    248
WV    217
VA    110
NC    110
PA     30
OH     23
Name: incident_state, dtype: int64

Here we see that almost 25% fraud reported. Let’s try to look for an indicative variable. Let’s analyze location. This dataset only has information from the mid-Atlantic states from the USA.

In [13]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax = df.groupby('incident_state').fraud_reported.count().plot.bar(ylim=0)
ax.set_ylabel('Fraud Reported')
ax.set_xlabel('Incident State')
plt.show()
In [14]:
df['incident_state_count'] = df['incident_state']
for i in range(len(df['incident_state_count'])):
    if df.iloc[i, 40] == "NY":
        df.iloc[i, 40] = 262
    if df.iloc[i, 40] == "SC":
        df.iloc[i, 40] = 248
    if df.iloc[i, 40] == "WV":
        df.iloc[i, 40] = 217
    if df.iloc[i, 40] == "VA":
        df.iloc[i, 40] = 110
    if df.iloc[i, 40] == "NC":
        df.iloc[i, 40] = 110
    if df.iloc[i, 40] == "PA":
        df.iloc[i, 40] = 30
    if df.iloc[i, 40] == "OH":
        df.iloc[i, 40] = 23
        
from plotly.offline import plot
import plotly.graph_objs as go

data = [go.Choropleth(autocolorscale = True, locations = df['incident_state'],
                      z = df['incident_state_count'],
                      locationmode = 'USA-states',
                      marker = go.choropleth.Marker(line = go.choropleth.marker.Line(color = 'rgb(255,255,255)', width = 2)),
                      colorbar = go.choropleth.ColorBar(title = "Number of Incidents"))]
layout = go.Layout(
    title = go.layout.Title(
        text = 'Insurance Incident Claims on the Mid-Atlantic'
    ),
    geo = go.layout.Geo(
        scope = 'usa',
        projection = go.layout.geo.Projection(type = 'albers usa'),
        showlakes = True,
        lakecolor = 'rgb(255, 255, 255)'),
)
fig = go.Figure(data = data, layout = layout)

#plot(fig, filename = 'd3-cloropleth-map')  # for showing in seprate tab
fig.show()
In [15]:
plt.rcParams['figure.figsize'] = [15, 8]
ax= plt.style.use('fivethirtyeight')
table=pd.crosstab(df.age, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Age vs Fraud Reported', fontsize=12)
plt.xlabel('Age')
plt.ylabel('Fraud Reported')
plt.show()

From above plot, it is obvious that, age is an important predictor for fraud reported. Age between 19-23 shows substantial number od fraud report.

In [16]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(18,8))
ax = df.groupby('incident_date').total_claim_amount.count().plot.bar(ylim=0)
ax.set_ylabel('Claim amount ($)')
ax.set_xlabel('Incident Date')
plt.show()

We see that, all the cases in above plot are for the months of January and February 2015

In [17]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax = df.groupby('policy_state').fraud_reported.count().plot.bar(ylim=0)
ax.set_ylabel('Fraud Reported')
ax.set_xlabel('Policy State')
plt.show()
In [18]:
plt.rcParams['figure.figsize'] = [10, 6]
ax= plt.style.use('fivethirtyeight')
table=pd.crosstab(df.policy_state, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Policy State vs Fraud Reported', fontsize=12)
plt.xlabel('Policy State')
plt.ylabel('Fraud Reported')
plt.show()
In [19]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax = df.groupby('incident_type').fraud_reported.count().plot.bar(ylim=0)
ax.set_xticklabels(ax.get_xticklabels(), rotation=20, ha="right")
ax.set_ylabel('Fraud Reported')
ax.set_xlabel('Incident Type')
plt.show()
In [20]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax = sns.countplot(x='incident_state', data=df)
ax.set_ylabel('Fraud Reported')
ax.set_xlabel('Incident State')
Out[20]:
Text(0.5, 0, 'Incident State')
In [21]:
fig = plt.figure(figsize=(10,6))
ax = sns.countplot(y = 'insured_education_level', data=df) 
ax.set_ylabel('Policy Annual Premium')
ax.set_xlabel('Insured Education Level')
plt.show()

# # Breakdown of Average Vehicle claim by insured's education level, grouped by fraud reported
In [22]:
fig = plt.figure(figsize=(16,10))
ax = sns.catplot(x='fraud_reported', y='policy_annual_premium',hue='insured_education_level', data=df,
                    kind="bar", ci=None, palette="muted",height=6, legend=True, aspect=1.2) 

ax.set_axis_labels("Fraud Reported", "Policy Annual Premium")

plt.show()
<Figure size 1152x720 with 0 Axes>
In [23]:
plt.rcParams['figure.figsize'] = [14, 6]
table=pd.crosstab(df.insured_education_level, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of insured education vs Fraud reported', fontsize=12)
plt.xlabel('Insured Education Level')
plt.ylabel('Fraud Reported');
In [24]:
plt.rcParams['figure.figsize'] = [6, 6]
ax = (df['insured_sex'].value_counts()*100.0 /len(df))\
.plot.pie(autopct='%.1f%%', labels = ['Male', 'Female'], fontsize=12)                                                                           
ax.set_title('% Gender')
plt.ylabel('Insured Sex')
plt.show()
In [25]:
plt.rcParams['figure.figsize'] = [11, 6]
table=pd.crosstab(df.insured_sex, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of insured_sex vs Fraud', fontsize=12)
plt.xlabel('Insured Sex')
plt.ylabel('Fraud Reported')
plt.show()
In [26]:
plt.rcParams['figure.figsize'] = [8, 8]
ax = (df['insured_relationship'].value_counts()*100.0 /len(df))\
.plot.pie(autopct='%.1f%%', labels = ['husband', 'wife', 'own-child', 'unmarried', 'other-relative', 'not-in-family'],
         fontsize=12)                                                                           
ax.set_title('% Relationship')
plt.ylabel('Insured Relationship')
plt.show()
In [27]:
table=pd.crosstab(df.insured_relationship, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of insured_relationship vs Fraud', fontsize=12)
plt.xlabel('Insured Relationship')
plt.ylabel('Fraud Reported')
plt.show()
In [28]:
fig = plt.figure(figsize=(6,6))
ax = (df['incident_type'].value_counts()*100.0 /len(df))\
.plot.pie(autopct='%.1f%%', labels = ['Parked Car', 'Single Vehile Collision', 'Multi-vehicle Collision', 'Vehicle Theft'],
         fontsize=12);
plt.ylabel('Incident Type')
Out[28]:
Text(0, 0.5, 'Incident Type')
In [29]:
fig = plt.figure(figsize=(6,6))
ax = (df['authorities_contacted'].value_counts()*100.0 /len(df))\
.plot.pie(autopct='%.1f%%', labels = ['Police', 'Fire', 'Other', 'None', 'Ambulance'],
         fontsize=12)
plt.ylabel('Authorities Contacted')
Out[29]:
Text(0, 0.5, 'Authorities Contacted')
In [30]:
fig = plt.figure(figsize=(12,6))
ax = sns.countplot(x='auto_make', data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.xlabel('Auto Make')
plt.ylabel('Auto Count')
plt.show()
In [31]:
fig = plt.figure(figsize=(6,6))
ax = (df['incident_severity'].value_counts()*100.0 /len(df))\
.plot.pie(autopct='%.1f%%', labels = ['Major Damage', 'Total Loss', 'Minor Damage', 'Trivial Damage'],
         fontsize=12)
plt.ylabel('Incident Severity');
In [32]:
fig = plt.figure(figsize=(10,6))
ax = sns.countplot(x='insured_hobbies', data=df)
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.xlabel('Insured Hobbies')
plt.ylabel('Count of Insured')
plt.show()
In [33]:
df["insured_occupation"].value_counts()
Out[33]:
machine-op-inspct    93
prof-specialty       85
tech-support         78
sales                76
exec-managerial      76
craft-repair         74
transport-moving     72
other-service        71
priv-house-serv      71
armed-forces         69
adm-clerical         65
protective-serv      63
handlers-cleaners    54
farming-fishing      53
Name: insured_occupation, dtype: int64
In [34]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax= df.groupby('auto_make').vehicle_claim.count().plot.bar(ylim=0)
ax.set_ylabel('Vehicle Claim')
ax.set_xlabel('Auto Make')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()
In [35]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax= df.groupby('insured_hobbies').total_claim_amount.count().plot.bar(ylim=0)
ax.set_ylabel('Total Claim Amount')
ax.set_xlabel('Insured Hobbies')
ax.set_xticklabels(ax.get_xticklabels(), rotation=40, ha="right")
plt.show()

Data Processing¶

Cleaning up the data and prepare it for machine learning model.

In [36]:
df['fraud_reported'].replace(to_replace='Y', value=1, inplace=True)
df['fraud_reported'].replace(to_replace='N',  value=0, inplace=True)

df.head()
Out[36]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39 incident_state_count
0 328 48 521585 17-10-2014 OH 250/500 1000 1406.91 0 466132 MALE MD craft-repair sleeping husband 53300 0 25-01-2015 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 9935 4th Drive 5 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 2004 1 NaN 248
1 228 42 342868 27-06-2006 IN 250/500 2000 1197.22 5000000 468176 MALE MD machine-op-inspct reading other-relative 0 0 21-01-2015 Vehicle Theft ? Minor Damage Police VA Riverwood 6608 MLK Hwy 8 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 2007 1 NaN 110
2 134 29 687698 06-09-2000 OH 100/300 2000 1413.14 5000000 430632 FEMALE PhD sales board-games own-child 35100 0 22-02-2015 Multi-vehicle Collision Rear Collision Minor Damage Police NY Columbus 7121 Francis Lane 7 3 NO 2 3 NO 34650 7700 3850 23100 Dodge RAM 2007 0 NaN 262
3 256 41 227811 25-05-1990 IL 250/500 2000 1415.74 6000000 608117 FEMALE PhD armed-forces board-games unmarried 48900 -62400 10-01-2015 Single Vehicle Collision Front Collision Major Damage Police OH Arlington 6956 Maple Drive 5 1 ? 1 2 NO 63400 6340 6340 50720 Chevrolet Tahoe 2014 1 NaN 23
4 228 44 367455 06-06-2014 IL 500/1000 1000 1583.91 6000000 610706 MALE Associate sales board-games unmarried 66000 -46000 17-02-2015 Vehicle Theft ? Minor Damage None NY Arlington 3041 3rd Ave 20 1 NO 0 1 NO 6500 1300 650 4550 Accura RSX 2009 0 NaN 262
In [37]:
df[['insured_zip']] = df[['insured_zip']].astype(object)
df.describe()
Out[37]:
months_as_customer age policy_number policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss incident_hour_of_the_day number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim auto_year fraud_reported _c39
count 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 1.000000e+03 1000.000000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.00000 1000.000000 1000.000000 1000.000000 1000.000000 1000.000000 0.0
mean 203.954000 38.948000 546238.648000 1136.000000 1256.406150 1.101000e+06 25126.100000 -26793.700000 11.644000 1.83900 0.992000 1.487000 52761.94000 7433.420000 7399.570000 37928.950000 2005.103000 0.247000 NaN
std 115.113174 9.140287 257063.005276 611.864673 244.167395 2.297407e+06 27872.187708 28104.096686 6.951373 1.01888 0.820127 1.111335 26401.53319 4880.951853 4824.726179 18886.252893 6.015861 0.431483 NaN
min 0.000000 19.000000 100804.000000 500.000000 433.330000 -1.000000e+06 0.000000 -111100.000000 0.000000 1.00000 0.000000 0.000000 100.00000 0.000000 0.000000 70.000000 1995.000000 0.000000 NaN
25% 115.750000 32.000000 335980.250000 500.000000 1089.607500 0.000000e+00 0.000000 -51500.000000 6.000000 1.00000 0.000000 1.000000 41812.50000 4295.000000 4445.000000 30292.500000 2000.000000 0.000000 NaN
50% 199.500000 38.000000 533135.000000 1000.000000 1257.200000 0.000000e+00 0.000000 -23250.000000 12.000000 1.00000 1.000000 1.000000 58055.00000 6775.000000 6750.000000 42100.000000 2005.000000 0.000000 NaN
75% 276.250000 44.000000 759099.750000 2000.000000 1415.695000 0.000000e+00 51025.000000 0.000000 17.000000 3.00000 2.000000 2.000000 70592.50000 11305.000000 10885.000000 50822.500000 2010.000000 0.000000 NaN
max 479.000000 64.000000 999435.000000 2000.000000 2047.590000 1.000000e+07 100500.000000 0.000000 23.000000 4.00000 2.000000 3.000000 114920.00000 21450.000000 23670.000000 79560.000000 2015.000000 1.000000 NaN

Some variables such as 'policy_bind_date', 'incident_date', 'incident_location' and 'insured_zip' contain very high number of level. We will remove these columns for our purposes.

Let's view summary of all the column with the object data-type :

In [38]:
df.describe(include='all')
Out[38]:
months_as_customer age policy_number policy_bind_date policy_state policy_csl policy_deductable policy_annual_premium umbrella_limit insured_zip insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_date incident_type collision_type incident_severity authorities_contacted incident_state incident_city incident_location incident_hour_of_the_day number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model auto_year fraud_reported _c39 incident_state_count
count 1000.000000 1000.000000 1000.000000 1000 1000 1000 1000.000000 1000.000000 1.000000e+03 1000.0 1000 1000 1000 1000 1000 1000.000000 1000.000000 1000 1000 1000 1000 1000 1000 1000 1000 1000.000000 1000.00000 1000 1000.000000 1000.000000 1000 1000.00000 1000.000000 1000.000000 1000.000000 1000 1000 1000.000000 1000.000000 0.0 1000.0
unique NaN NaN NaN 951 3 3 NaN NaN NaN 995.0 2 7 14 20 6 NaN NaN 60 4 4 4 5 7 7 1000 NaN NaN 3 NaN NaN 3 NaN NaN NaN NaN 14 39 NaN NaN NaN 6.0
top NaN NaN NaN 01-01-2006 OH 250/500 NaN NaN NaN 431202.0 FEMALE JD machine-op-inspct reading own-child NaN NaN 02-02-2015 Multi-vehicle Collision Rear Collision Minor Damage Police NY Springfield 6435 Texas Ave NaN NaN ? NaN NaN ? NaN NaN NaN NaN Suburu RAM NaN NaN NaN 262.0
freq NaN NaN NaN 3 352 351 NaN NaN NaN 2.0 537 161 93 64 183 NaN NaN 28 419 292 354 292 262 157 1 NaN NaN 360 NaN NaN 343 NaN NaN NaN NaN 80 43 NaN NaN NaN 262.0
mean 203.954000 38.948000 546238.648000 NaN NaN NaN 1136.000000 1256.406150 1.101000e+06 NaN NaN NaN NaN NaN NaN 25126.100000 -26793.700000 NaN NaN NaN NaN NaN NaN NaN NaN 11.644000 1.83900 NaN 0.992000 1.487000 NaN 52761.94000 7433.420000 7399.570000 37928.950000 NaN NaN 2005.103000 0.247000 NaN NaN
std 115.113174 9.140287 257063.005276 NaN NaN NaN 611.864673 244.167395 2.297407e+06 NaN NaN NaN NaN NaN NaN 27872.187708 28104.096686 NaN NaN NaN NaN NaN NaN NaN NaN 6.951373 1.01888 NaN 0.820127 1.111335 NaN 26401.53319 4880.951853 4824.726179 18886.252893 NaN NaN 6.015861 0.431483 NaN NaN
min 0.000000 19.000000 100804.000000 NaN NaN NaN 500.000000 433.330000 -1.000000e+06 NaN NaN NaN NaN NaN NaN 0.000000 -111100.000000 NaN NaN NaN NaN NaN NaN NaN NaN 0.000000 1.00000 NaN 0.000000 0.000000 NaN 100.00000 0.000000 0.000000 70.000000 NaN NaN 1995.000000 0.000000 NaN NaN
25% 115.750000 32.000000 335980.250000 NaN NaN NaN 500.000000 1089.607500 0.000000e+00 NaN NaN NaN NaN NaN NaN 0.000000 -51500.000000 NaN NaN NaN NaN NaN NaN NaN NaN 6.000000 1.00000 NaN 0.000000 1.000000 NaN 41812.50000 4295.000000 4445.000000 30292.500000 NaN NaN 2000.000000 0.000000 NaN NaN
50% 199.500000 38.000000 533135.000000 NaN NaN NaN 1000.000000 1257.200000 0.000000e+00 NaN NaN NaN NaN NaN NaN 0.000000 -23250.000000 NaN NaN NaN NaN NaN NaN NaN NaN 12.000000 1.00000 NaN 1.000000 1.000000 NaN 58055.00000 6775.000000 6750.000000 42100.000000 NaN NaN 2005.000000 0.000000 NaN NaN
75% 276.250000 44.000000 759099.750000 NaN NaN NaN 2000.000000 1415.695000 0.000000e+00 NaN NaN NaN NaN NaN NaN 51025.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 17.000000 3.00000 NaN 2.000000 2.000000 NaN 70592.50000 11305.000000 10885.000000 50822.500000 NaN NaN 2010.000000 0.000000 NaN NaN
max 479.000000 64.000000 999435.000000 NaN NaN NaN 2000.000000 2047.590000 1.000000e+07 NaN NaN NaN NaN NaN NaN 100500.000000 0.000000 NaN NaN NaN NaN NaN NaN NaN NaN 23.000000 4.00000 NaN 2.000000 3.000000 NaN 114920.00000 21450.000000 23670.000000 79560.000000 NaN NaN 2015.000000 1.000000 NaN NaN

Some values in the table are shown here as “NaN”. We will see how to deal with these missing values.

In [39]:
plt.style.use('fivethirtyeight')
plt.figure(figsize=(14,6))
table=pd.crosstab(df.policy_csl, df.fraud_reported)
table.div(table.sum(1).astype(float), axis=0).plot(kind='bar', stacked=True)
plt.title('Stacked Bar Chart of Policy Csl vs Fraud', fontsize=12)
plt.xlabel('Policy Csl')
plt.ylabel('Fraud Reported')
plt.show();
<Figure size 1008x432 with 0 Axes>

policy_csl looks like an unavidable predictor.

In [40]:
df['csl_per_person'] = df.policy_csl.str.split('/', expand=True)[0]
df['csl_per_accident'] = df.policy_csl.str.split('/', expand=True)[1]
df['csl_per_person'].head()
Out[40]:
0    250
1    250
2    100
3    250
4    500
Name: csl_per_person, dtype: object
In [41]:
df['csl_per_accident'].head()
Out[41]:
0     500
1     500
2     300
3     500
4    1000
Name: csl_per_accident, dtype: object
In [42]:
df.auto_year.value_counts()  # check the spread of years to decide on further action.
Out[42]:
1995    56
1999    55
2005    54
2011    53
2006    53
2007    52
2003    51
2010    50
2009    50
2013    49
2002    49
2015    47
1997    46
2012    46
2008    45
2014    44
2001    42
2000    42
1998    40
2004    39
1996    37
Name: auto_year, dtype: int64

auto_year has 21 levels, and the number of records for each of the levels are quite significant considering datasize is not so large. We will do some feature engineering using this variable considering, the year of manufacturing of automobile indicates the age of the vehicle and may contain valuable information for insurance premium or fraud is concerned.

In [43]:
df['vehicle_age'] = 2018 - df['auto_year'] # Deriving the age of the vehicle based on the year value 
df['vehicle_age'].head(10)
Out[43]:
0    14
1    11
2    11
3     4
4     9
5    15
6     6
7     3
8     6
9    22
Name: vehicle_age, dtype: int64
In [44]:
bins = [-1, 3, 6, 9, 12, 17, 20, 24]  # Factorize according to the time period of the day.
names = ["past_midnight", "early_morning", "morning", 'fore-noon', 'afternoon', 'evening', 'night']
df['incident_period_of_day'] = pd.cut(df.incident_hour_of_the_day, bins, labels=names).astype(object)
df[['incident_hour_of_the_day', 'incident_period_of_day']].head(20)
Out[44]:
incident_hour_of_the_day incident_period_of_day
0 5 early_morning
1 8 morning
2 7 morning
3 5 early_morning
4 20 evening
5 19 evening
6 0 past_midnight
7 23 night
8 21 night
9 14 afternoon
10 22 night
11 21 night
12 9 morning
13 5 early_morning
14 12 fore-noon
15 12 fore-noon
16 0 past_midnight
17 9 morning
18 19 evening
19 8 morning
In [45]:
# Check on categorical variables:
df.select_dtypes(include=['object']).columns  # checking categorcial columns
Out[45]:
Index(['policy_bind_date', 'policy_state', 'policy_csl', 'insured_zip',
       'insured_sex', 'insured_education_level', 'insured_occupation',
       'insured_hobbies', 'insured_relationship', 'incident_date',
       'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'incident_location', 'property_damage', 'police_report_available',
       'auto_make', 'auto_model', 'incident_state_count', 'csl_per_person',
       'csl_per_accident', 'incident_period_of_day'],
      dtype='object')
In [46]:
# dropping unimportant columns

df = df.drop(columns = [
    'policy_number', 
    'policy_csl',
    'insured_zip',
    'policy_bind_date', 
    'incident_date', 
    'incident_location', 
    '_c39', 
    'auto_year', 
    'incident_hour_of_the_day'])

df.head(2)
Out[46]:
months_as_customer age policy_state policy_deductable policy_annual_premium umbrella_limit insured_sex insured_education_level insured_occupation insured_hobbies insured_relationship capital-gains capital-loss incident_type collision_type incident_severity authorities_contacted incident_state incident_city number_of_vehicles_involved property_damage bodily_injuries witnesses police_report_available total_claim_amount injury_claim property_claim vehicle_claim auto_make auto_model fraud_reported incident_state_count csl_per_person csl_per_accident vehicle_age incident_period_of_day
0 328 48 OH 1000 1406.91 0 MALE MD craft-repair sleeping husband 53300 0 Single Vehicle Collision Side Collision Major Damage Police SC Columbus 1 YES 1 2 YES 71610 6510 13020 52080 Saab 92x 1 248 250 500 14 early_morning
1 228 42 IN 2000 1197.22 5000000 MALE MD machine-op-inspct reading other-relative 0 0 Vehicle Theft ? Minor Damage Police VA Riverwood 1 ? 0 0 ? 5070 780 780 3510 Mercedes E400 1 110 250 500 11 morning
In [47]:
# identify variables with '?' values
unknowns = {}
for i in list(df.columns):
    if (df[i]).dtype == object:
        j = np.sum(df[i] == "?")
        unknowns[i] = j
unknowns = pd.DataFrame.from_dict(unknowns, orient = 'index')
print(unknowns)
                           0
policy_state               0
insured_sex                0
insured_education_level    0
insured_occupation         0
insured_hobbies            0
insured_relationship       0
incident_type              0
collision_type           178
incident_severity          0
authorities_contacted      0
incident_state             0
incident_city              0
property_damage          360
police_report_available  343
auto_make                  0
auto_model                 0
incident_state_count       0
csl_per_person             0
csl_per_accident           0
incident_period_of_day     0

collision_type, property_damage, police_report_available contain many missing values. So, first isolate these variables, inspect these individually for spread of category values.

In [48]:
df.collision_type.value_counts()
Out[48]:
Rear Collision     292
Side Collision     276
Front Collision    254
?                  178
Name: collision_type, dtype: int64
In [49]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax= df.groupby('collision_type').police_report_available.count().plot.bar(ylim=0)
ax.set_ylabel('Police Report')
ax.set_xlabel('Collision Type')
ax.set_xticklabels(ax.get_xticklabels(), rotation=10, ha="right")
plt.show()
In [50]:
df.property_damage.value_counts()
Out[50]:
?      360
NO     338
YES    302
Name: property_damage, dtype: int64
In [51]:
plt.style.use('fivethirtyeight')
fig = plt.figure(figsize=(10,6))
ax= df.groupby('property_damage').police_report_available.count().plot.bar(ylim=0)
ax.set_ylabel('Police Report')
ax.set_xlabel('Property Damage')
ax.set_xticklabels(ax.get_xticklabels(), rotation=10, ha="right")
plt.show()
In [52]:
df.police_report_available.value_counts()
Out[52]:
?      343
NO     343
YES    314
Name: police_report_available, dtype: int64
In [53]:
df.columns
Out[53]:
Index(['months_as_customer', 'age', 'policy_state', 'policy_deductable',
       'policy_annual_premium', 'umbrella_limit', 'insured_sex',
       'insured_education_level', 'insured_occupation', 'insured_hobbies',
       'insured_relationship', 'capital-gains', 'capital-loss',
       'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'number_of_vehicles_involved', 'property_damage', 'bodily_injuries',
       'witnesses', 'police_report_available', 'total_claim_amount',
       'injury_claim', 'property_claim', 'vehicle_claim', 'auto_make',
       'auto_model', 'fraud_reported', 'incident_state_count',
       'csl_per_person', 'csl_per_accident', 'vehicle_age',
       'incident_period_of_day'],
      dtype='object')
In [54]:
df._get_numeric_data().head()  # Checking numeric columns
Out[54]:
months_as_customer age policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim fraud_reported vehicle_age
0 328 48 1000 1406.91 0 53300 0 1 1 2 71610 6510 13020 52080 1 14
1 228 42 2000 1197.22 5000000 0 0 1 0 0 5070 780 780 3510 1 11
2 134 29 2000 1413.14 5000000 35100 0 3 2 3 34650 7700 3850 23100 0 11
3 256 41 2000 1415.74 6000000 48900 -62400 1 1 2 63400 6340 6340 50720 1 4
4 228 44 1000 1583.91 6000000 66000 -46000 1 0 1 6500 1300 650 4550 0 9
In [55]:
df._get_numeric_data().columns
Out[55]:
Index(['months_as_customer', 'age', 'policy_deductable',
       'policy_annual_premium', 'umbrella_limit', 'capital-gains',
       'capital-loss', 'number_of_vehicles_involved', 'bodily_injuries',
       'witnesses', 'total_claim_amount', 'injury_claim', 'property_claim',
       'vehicle_claim', 'fraud_reported', 'vehicle_age'],
      dtype='object')
In [56]:
df.select_dtypes(include=['object']).columns  # checking categorcial columns
Out[56]:
Index(['policy_state', 'insured_sex', 'insured_education_level',
       'insured_occupation', 'insured_hobbies', 'insured_relationship',
       'incident_type', 'collision_type', 'incident_severity',
       'authorities_contacted', 'incident_state', 'incident_city',
       'property_damage', 'police_report_available', 'auto_make', 'auto_model',
       'incident_state_count', 'csl_per_person', 'csl_per_accident',
       'incident_period_of_day'],
      dtype='object')

Applying one-hot encoding to convert all categorical variables except out target variables

'collision_type', 'property_damage', 'police_report_available', 'fraud_reported'

In [57]:
dummies = pd.get_dummies(df[[
    'policy_state', 
    'insured_sex', 
    'insured_education_level',
    'insured_occupation', 
    'insured_hobbies', 
    'insured_relationship',
    'incident_type', 
    'incident_severity',
    'authorities_contacted', 
    'incident_state', 
    'incident_city',
    'auto_make', 
    'auto_model', 
    'csl_per_person', 
    'csl_per_accident',
    'incident_period_of_day']])

dummies = dummies.join(df[[
    'collision_type', 
    'property_damage', 
    'police_report_available', 
    "fraud_reported"]])

dummies.head()
Out[57]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight collision_type property_damage police_report_available fraud_reported
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 Side Collision YES YES 1
1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 ? ? ? 1
2 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 Rear Collision NO NO 0
3 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 Front Collision ? NO 1
4 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 ? NO NO 0
In [58]:
X = dummies.iloc[:, 0:-1]  # predictor variables
y = dummies.iloc[:, -1]  # target variable

len(X.columns)
Out[58]:
148
In [59]:
X.head(2)
Out[59]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight collision_type property_damage police_report_available
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 Side Collision YES YES
1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 ? ? ?
In [60]:
y.head()
Out[60]:
0    1
1    1
2    0
3    1
4    0
Name: fraud_reported, dtype: int64

Label encoding¶

In [61]:
from sklearn.preprocessing import LabelEncoder
X['collision_en'] = LabelEncoder().fit_transform(dummies['collision_type'])
X[['collision_type', 'collision_en']]
Out[61]:
collision_type collision_en
0 Side Collision 3
1 ? 0
2 Rear Collision 2
3 Front Collision 1
4 ? 0
... ... ...
995 Front Collision 1
996 Rear Collision 2
997 Side Collision 3
998 Rear Collision 2
999 ? 0

1000 rows × 2 columns

In [62]:
X['property_damage'].replace(to_replace='YES', value=1, inplace=True)
X['property_damage'].replace(to_replace='NO', value=0, inplace=True)
X['property_damage'].replace(to_replace='?', value=0, inplace=True)
X['police_report_available'].replace(to_replace='YES', value=1, inplace=True)
X['police_report_available'].replace(to_replace='NO', value=0, inplace=True)
X['police_report_available'].replace(to_replace='?', value=0, inplace=True)

X.head(10)
Out[62]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight collision_type property_damage police_report_available collision_en
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 Side Collision 1 1 3
1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 ? 0 0 0
2 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 Rear Collision 0 0 2
3 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 Front Collision 0 0 1
4 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 1 0 0 0 0 1 0 0 0 0 ? 0 0 0
5 0 0 1 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 Rear Collision 0 0 2
6 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 Front Collision 0 0 1
7 1 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 Front Collision 0 1 1
8 1 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 1 0 Front Collision 0 1 1
9 1 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 1 0 0 0 0 1 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 0 Rear Collision 0 0 2
In [63]:
X = X.drop(columns = ['collision_type'])
X.head(2)
Out[63]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight property_damage police_report_available collision_en
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 3
1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0
In [64]:
X = pd.concat([X, df._get_numeric_data()], axis=1)  # joining numeric columns
X.head(2)
Out[64]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight property_damage police_report_available collision_en months_as_customer age policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim fraud_reported vehicle_age
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 3 328 48 1000 1406.91 0 53300 0 1 1 2 71610 6510 13020 52080 1 14
1 0 1 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0 1 0 0 0 0 0 228 42 2000 1197.22 5000000 0 0 1 0 0 5070 780 780 3510 1 11
In [65]:
X.columns
Out[65]:
Index(['policy_state_IL', 'policy_state_IN', 'policy_state_OH',
       'insured_sex_FEMALE', 'insured_sex_MALE',
       'insured_education_level_Associate', 'insured_education_level_College',
       'insured_education_level_High School', 'insured_education_level_JD',
       'insured_education_level_MD',
       ...
       'capital-loss', 'number_of_vehicles_involved', 'bodily_injuries',
       'witnesses', 'total_claim_amount', 'injury_claim', 'property_claim',
       'vehicle_claim', 'fraud_reported', 'vehicle_age'],
      dtype='object', length=164)
In [66]:
X = X.drop(columns = ['fraud_reported'])  # dropping target variable 'fraud_reported'
X.columns
Out[66]:
Index(['policy_state_IL', 'policy_state_IN', 'policy_state_OH',
       'insured_sex_FEMALE', 'insured_sex_MALE',
       'insured_education_level_Associate', 'insured_education_level_College',
       'insured_education_level_High School', 'insured_education_level_JD',
       'insured_education_level_MD',
       ...
       'capital-gains', 'capital-loss', 'number_of_vehicles_involved',
       'bodily_injuries', 'witnesses', 'total_claim_amount', 'injury_claim',
       'property_claim', 'vehicle_claim', 'vehicle_age'],
      dtype='object', length=163)

We now have a dataset that we could use to evaluate an algorithm sensitive to missing values like LDA.¶

In [67]:
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.model_selection import KFold
from sklearn.model_selection import cross_val_score

# evaluate an LDA model on the dataset using k-fold cross validation
model = LinearDiscriminantAnalysis()
kfold = KFold(n_splits=5, random_state=7)
result = cross_val_score(model, X, y, cv=kfold, scoring='accuracy')
print(result.mean())
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:552: FitFailedWarning:

Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py", line 464, in fit
    self._solve_svd(X, y)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py", line 381, in _solve_svd
    U, S, V = linalg.svd(X, full_matrices=False)
  File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\linalg\decomp_svd.py", line 132, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge


nan
In [68]:
print("Accuracy: %0.2f (+/- %0.2f)" % (result.mean(), result.std() * 2))
Accuracy: nan (+/- nan)

84 % cross validation score without standardizing the data. Above is the mean score and the 95% confidence interval of the score estimate. This looks good to go for other Classification methods.

Creating a Training Set for the Data Set

In [69]:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=0.8, random_state=7)
print('length of X_train and X_test: ', len(X_train), len(X_test))
print('length of y_train and y_test: ', len(y_train), len(y_test))
length of X_train and X_test:  800 200
length of y_train and y_test:  800 200

Random Forest Classification¶

In [70]:
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score, recall_score, classification_report, cohen_kappa_score
from sklearn import metrics 

# Baseline Random forest based Model
rfc = RandomForestClassifier(n_estimators=200)
    
kfold = KFold(n_splits=5, random_state=7)
result2 = cross_val_score(rfc, X_train, y_train, cv=kfold, scoring='accuracy')
print(result2.mean())
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

0.765
In [71]:
# Generate a Histogram plot for anomaly detection
plt.style.use('fivethirtyeight')
plt.rcParams['figure.figsize'] = [15, 8]
df.plot(kind='hist')
plt.show()
In [72]:
plt.rcParams['figure.figsize'] = [5, 5]
sns.boxplot(x=X.policy_annual_premium)
plt.xlabel('Policy Annual Premium')
plt.show()
In [73]:
plt.rcParams['figure.figsize'] = [5, 5]
sns.boxplot(x=X.witnesses)
plt.xlabel('Witnesses')
plt.show()
In [74]:
plt.rcParams['figure.figsize'] = [5, 5]
sns.boxplot(x=X.vehicle_age)
plt.xlabel('Vehicle Age')
plt.show()
In [75]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler(with_mean=False)
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

X_train_scaled
Out[75]:
array([[0.        , 0.        , 2.08816082, ..., 0.90375361, 1.84093195,
        1.81430422],
       [0.        , 0.        , 2.08816082, ..., 2.10117545, 1.60502305,
        2.14417772],
       [0.        , 2.16950399, 0.        , ..., 2.69678424, 2.74665362,
        0.82468374],
       ...,
       [2.11480423, 0.        , 0.        , ..., 1.26980485, 2.58657258,
        3.7935452 ],
       [0.        , 2.16950399, 0.        , ..., 0.10960856, 0.22327092,
        2.80392471],
       [0.        , 2.16950399, 0.        , ..., 3.29239303, 2.51495738,
        1.15455723]])
In [76]:
X_train_scaled = pd.DataFrame(X_train_scaled, columns = X_train.columns) # retaining columns names
X_train_scaled.head(2)
Out[76]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight property_damage police_report_available collision_en months_as_customer age policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim vehicle_age
0 0.0 0.0 2.088161 0.0 2.006033 0.0 0.0 2.685971 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.660072 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.639399 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 2.624446 0.0 0.0 0.0 0.000000 0.0 2.044498 0.0 0.0 2.083333 0.0 0.0 2.536374 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 3.262074 0.0 0.00000 0.0 2.860721 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.686049 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 5.441296 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.086534 0.0 0.0 0.0 2.086534 0.0 3.023716 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.793297 0.548299 3.132848 3.280121 4.926916 0.0 0.0000 -2.171862 0.980864 1.229787 1.799726 1.644157 0.901702 0.903754 1.840932 1.814304
1 0.0 0.0 2.088161 0.0 2.006033 0.0 0.0 2.685971 0.0 0.0 0.0 0.0 0.0 0.0 0.0 3.660072 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 4.386345 0.0 0.0 0.0 0.0 0.0 0.0 2.685971 0.0 0.000000 0.0 0.0 0.0 2.026102 0.0 0.000000 0.0 0.0 2.083333 0.0 0.0 2.536374 0.0 0.0 0.0 0.0 3.333333 0.0 0.0 0.0 0.0 0.000000 0.0 2.80056 0.0 0.000000 0.0 0.0 0.0 0.0 3.826087 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 5.181386 0.0 0.0 0.0 0.0 0.0 0.000000 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.086534 0.0 0.0 0.0 2.086534 0.0 3.023716 0.0 0.0 0.0 0.0 0.0 0.0 0.0 2.793297 1.173702 3.348906 3.280121 5.493052 0.0 1.8834 -1.436117 2.942591 1.229787 1.799726 1.911286 2.096404 2.101175 1.605023 2.144178
In [77]:
# Generate a Histogram plot on scaled data to check anomalies
plt.rcParams['figure.figsize'] = [15, 8]
X_train_scaled.plot(kind='hist')
Out[77]:
<AxesSubplot:ylabel='Frequency'>
In [78]:
x_train_scaled = pd.DataFrame.to_numpy(X_train_scaled) # converting to array for computational ease
x_train_scaled
Out[78]:
array([[0.        , 0.        , 2.08816082, ..., 0.90375361, 1.84093195,
        1.81430422],
       [0.        , 0.        , 2.08816082, ..., 2.10117545, 1.60502305,
        2.14417772],
       [0.        , 2.16950399, 0.        , ..., 2.69678424, 2.74665362,
        0.82468374],
       ...,
       [2.11480423, 0.        , 0.        , ..., 1.26980485, 2.58657258,
        3.7935452 ],
       [0.        , 2.16950399, 0.        , ..., 0.10960856, 0.22327092,
        2.80392471],
       [0.        , 2.16950399, 0.        , ..., 3.29239303, 2.51495738,
        1.15455723]])
In [79]:
from sklearn.ensemble import AdaBoostClassifier, VotingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn import model_selection
from sklearn.model_selection import KFold, cross_val_score
from xgboost import XGBClassifier
from sklearn.linear_model import LogisticRegressionCV
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

xgb = XGBClassifier()
logreg= LogisticRegressionCV(solver='lbfgs', cv=10)
knn = KNeighborsClassifier(5)
svcl = SVC()
adb = AdaBoostClassifier()
dt = DecisionTreeClassifier(max_depth=5)
rf = RandomForestClassifier()
lda = LinearDiscriminantAnalysis()
gnb = GaussianNB()

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=10)))
models.append(('XGB', XGBClassifier()))
models.append(('KNN', KNeighborsClassifier()))
models.append(('DT', DecisionTreeClassifier()))
models.append(('SVM', SVC(gamma='auto')))
models.append(('RF', RandomForestClassifier(n_estimators=200)))
models.append(('ADA', AdaBoostClassifier(n_estimators=200)))
models.append(('LDA', LinearDiscriminantAnalysis()))
models.append(('GNB', GaussianNB()))
              
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, x_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)

# boxplot algorithm comparison
plt.rcParams['figure.figsize'] = [15, 8]              
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)
plt.boxplot(results)
ax.set_xticklabels(names)
plt.show()
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

LR: 0.826250 (0.034664)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

XGB: 0.828750 (0.023083)
KNN: 0.735000 (0.055283)
DT: 0.795000 (0.029686)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

SVM: 0.780000 (0.038810)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

RF: 0.797500 (0.047697)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

ADA: 0.797500 (0.036142)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py:552: FitFailedWarning:

Estimator fit failed. The score on this train-test partition for these parameters will be set to nan. Details: 
Traceback (most recent call last):
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_validation.py", line 531, in _fit_and_score
    estimator.fit(X_train, y_train, **fit_params)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py", line 464, in fit
    self._solve_svd(X, y)
  File "C:\ProgramData\Anaconda3\lib\site-packages\sklearn\discriminant_analysis.py", line 381, in _solve_svd
    U, S, V = linalg.svd(X, full_matrices=False)
  File "C:\ProgramData\Anaconda3\lib\site-packages\scipy\linalg\decomp_svd.py", line 132, in svd
    raise LinAlgError("SVD did not converge")
numpy.linalg.LinAlgError: SVD did not converge


LDA: nan (nan)
GNB: 0.618750 (0.076291)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

In [80]:
clf1= LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=10)
clf2 = XGBClassifier() 

clf = [
    ('LR', clf1), 
    ('XGB', clf2)] 
    
#create our voting classifier, inputting our models
eclf= VotingClassifier(estimators=[
    ('LR', clf1), 
    ('XGB', clf2)], voting='hard')

for clf, label in zip([clf1, clf2, eclf], [
    'Logistic Regression', 
    'XGB Classifier',
    'Ensemble']):
    
    scores = cross_val_score(clf, x_train_scaled, y_train, cv=10, scoring='accuracy')
    print("Accuracy: %0.2f (+/- %0.2f) [%s]" % (scores.mean(), scores.std(), label))
Accuracy: 0.83 (+/- 0.03) [Logistic Regression]
Accuracy: 0.83 (+/- 0.02) [XGB Classifier]
Accuracy: 0.82 (+/- 0.03) [Ensemble]
In [81]:
from numpy import sort
from sklearn.feature_selection import SelectFromModel

# fit model on all training data
xgb = XGBClassifier()
xgb.fit(x_train_scaled, y_train)

# make predictions for test data and evaluate
xgb_pred = xgb.predict(X_test_scaled)
predictions = [round(value) for value in xgb_pred]
accuracy = accuracy_score(y_test, xgb_pred)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

# Fit model using each importance as a threshold
thresholds = sort(xgb.feature_importances_)
for thresh in thresholds:
    
    # select features using threshold
    selection = SelectFromModel(xgb, threshold=thresh, prefit=True)
    select_X_train = selection.transform(x_train_scaled)
    
    # train model
    selection_model = XGBClassifier()
    selection_model.fit(select_X_train, y_train)
    
    # eval model
    select_X_test = selection.transform(X_test_scaled)
    xgb_pred = selection_model.predict(select_X_test)
    predictions = [round(value) for value in xgb_pred]
    accuracy = accuracy_score(y_test, xgb_pred)
    print("Thresh=%.3f, n=%d, Accuracy: %.2f%%" % (thresh, select_X_train.shape[1], accuracy*100.0))
Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.000, n=163, Accuracy: 79.50%
Thresh=0.002, n=73, Accuracy: 79.50%
Thresh=0.004, n=72, Accuracy: 79.50%
Thresh=0.004, n=71, Accuracy: 79.50%
Thresh=0.004, n=70, Accuracy: 79.50%
Thresh=0.005, n=69, Accuracy: 80.00%
Thresh=0.005, n=68, Accuracy: 79.50%
Thresh=0.005, n=67, Accuracy: 79.50%
Thresh=0.005, n=66, Accuracy: 80.50%
Thresh=0.005, n=65, Accuracy: 80.50%
Thresh=0.006, n=64, Accuracy: 80.00%
Thresh=0.006, n=63, Accuracy: 80.00%
Thresh=0.006, n=62, Accuracy: 81.00%
Thresh=0.006, n=61, Accuracy: 79.50%
Thresh=0.006, n=60, Accuracy: 79.50%
Thresh=0.007, n=59, Accuracy: 80.50%
Thresh=0.007, n=58, Accuracy: 80.00%
Thresh=0.007, n=57, Accuracy: 80.00%
Thresh=0.007, n=56, Accuracy: 80.50%
Thresh=0.007, n=55, Accuracy: 79.00%
Thresh=0.007, n=54, Accuracy: 79.00%
Thresh=0.007, n=53, Accuracy: 81.00%
Thresh=0.007, n=52, Accuracy: 79.50%
Thresh=0.008, n=51, Accuracy: 80.50%
Thresh=0.008, n=50, Accuracy: 80.50%
Thresh=0.008, n=49, Accuracy: 80.50%
Thresh=0.009, n=48, Accuracy: 82.00%
Thresh=0.009, n=47, Accuracy: 82.00%
Thresh=0.009, n=46, Accuracy: 82.00%
Thresh=0.010, n=45, Accuracy: 80.50%
Thresh=0.010, n=44, Accuracy: 80.50%
Thresh=0.010, n=43, Accuracy: 81.00%
Thresh=0.010, n=42, Accuracy: 82.00%
Thresh=0.010, n=41, Accuracy: 82.50%
Thresh=0.010, n=40, Accuracy: 80.00%
Thresh=0.010, n=39, Accuracy: 80.00%
Thresh=0.010, n=38, Accuracy: 80.00%
Thresh=0.010, n=37, Accuracy: 82.00%
Thresh=0.011, n=36, Accuracy: 81.00%
Thresh=0.011, n=35, Accuracy: 80.00%
Thresh=0.011, n=34, Accuracy: 78.00%
Thresh=0.011, n=33, Accuracy: 78.00%
Thresh=0.011, n=32, Accuracy: 79.50%
Thresh=0.011, n=31, Accuracy: 78.00%
Thresh=0.011, n=30, Accuracy: 78.00%
Thresh=0.011, n=29, Accuracy: 80.50%
Thresh=0.012, n=28, Accuracy: 81.50%
Thresh=0.012, n=27, Accuracy: 81.50%
Thresh=0.012, n=26, Accuracy: 82.00%
Thresh=0.012, n=25, Accuracy: 80.50%
Thresh=0.012, n=24, Accuracy: 79.50%
Thresh=0.013, n=23, Accuracy: 79.00%
Thresh=0.013, n=22, Accuracy: 79.00%
Thresh=0.013, n=21, Accuracy: 80.00%
Thresh=0.013, n=20, Accuracy: 79.50%
Thresh=0.013, n=19, Accuracy: 81.00%
Thresh=0.013, n=18, Accuracy: 80.00%
Thresh=0.013, n=17, Accuracy: 81.00%
Thresh=0.014, n=16, Accuracy: 80.00%
Thresh=0.014, n=15, Accuracy: 79.50%
Thresh=0.015, n=14, Accuracy: 81.50%
Thresh=0.016, n=13, Accuracy: 81.00%
Thresh=0.016, n=12, Accuracy: 82.00%
Thresh=0.017, n=11, Accuracy: 81.50%
Thresh=0.018, n=10, Accuracy: 82.00%
Thresh=0.018, n=9, Accuracy: 82.00%
Thresh=0.018, n=8, Accuracy: 83.00%
Thresh=0.018, n=7, Accuracy: 82.00%
Thresh=0.018, n=6, Accuracy: 83.50%
Thresh=0.019, n=5, Accuracy: 82.50%
Thresh=0.024, n=4, Accuracy: 82.50%
Thresh=0.057, n=3, Accuracy: 83.00%
Thresh=0.079, n=2, Accuracy: 81.00%
Thresh=0.136, n=1, Accuracy: 79.00%
In [82]:
from xgboost import plot_importance
x = XGBClassifier()
x.fit(X_train_scaled, y_train) # fitting the model again on dataframe to identify the feature names

plt.rcParams['figure.figsize'] = [25, 20]
# plot feature importance
plot_importance(x);
In [83]:
from pprint import pprint
# Check parameters used 
print('Parameters currently in use:\n')
pprint(x.get_params())
Parameters currently in use:

{'base_score': 0.5,
 'booster': 'gbtree',
 'colsample_bylevel': 1,
 'colsample_bynode': 1,
 'colsample_bytree': 1,
 'gamma': 0,
 'learning_rate': 0.1,
 'max_delta_step': 0,
 'max_depth': 3,
 'min_child_weight': 1,
 'missing': None,
 'n_estimators': 100,
 'n_jobs': 1,
 'nthread': None,
 'objective': 'binary:logistic',
 'random_state': 0,
 'reg_alpha': 0,
 'reg_lambda': 1,
 'scale_pos_weight': 1,
 'seed': None,
 'silent': None,
 'subsample': 1,
 'verbosity': 1}
In [84]:
from sklearn.model_selection import StratifiedKFold
from sklearn.model_selection import GridSearchCV
import matplotlib
matplotlib.use('Agg')
from matplotlib import pyplot
plt.rcParams['figure.figsize'] = [10, 6]

# grid search
max_depth = range(1, 11, 2)
print(max_depth)

param_grid = dict(max_depth=max_depth)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(xgb, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1,  iid=False)
grid_result = grid_search.fit(x_train_scaled, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))


# plot
pyplot.errorbar(max_depth, means, yerr=stds)
pyplot.title("XGBoost max_depth vs Log Loss")
pyplot.xlabel('max_depth')
pyplot.ylabel('Log Loss')
range(1, 11, 2)
Fitting 10 folds for each of 5 candidates, totalling 50 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:   14.7s
[Parallel(n_jobs=-1)]: Done  50 out of  50 | elapsed:   18.1s finished
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:849: FutureWarning:

The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.

Best: -0.368020 using {'max_depth': 1}
-0.368020 (0.058795) with: {'max_depth': 1}
-0.383396 (0.068880) with: {'max_depth': 3}
-0.417851 (0.102162) with: {'max_depth': 5}
-0.436605 (0.118897) with: {'max_depth': 7}
-0.455016 (0.125771) with: {'max_depth': 9}
Out[84]:
Text(0, 0.5, 'Log Loss')
In [85]:
import numpy

n_estimators = [50, 100, 150, 200]
max_depth = [2, 4, 6, 8]
print(max_depth)
param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)
kfold = StratifiedKFold(n_splits=10, shuffle=True, random_state=7)
grid_search = GridSearchCV(xgb, param_grid, scoring="neg_log_loss", n_jobs=-1, cv=kfold, verbose=1, iid=False)
grid_result = grid_search.fit(x_train_scaled, y_train)

# summarize results
print("Best: %f using %s" % (grid_result.best_score_, grid_result.best_params_))
means = grid_result.cv_results_['mean_test_score']
stds = grid_result.cv_results_['std_test_score']
params = grid_result.cv_results_['params']
for mean, stdev, param in zip(means, stds, params):
    print("%f (%f) with: %r" % (mean, stdev, param))

# plot results
scores = numpy.array(means).reshape(len(max_depth), len(n_estimators))
for i, value in enumerate(max_depth):
    pyplot.plot(n_estimators, scores[i], label='depth: ' + str(value))
pyplot.legend()
pyplot.xlabel('n_estimators')
pyplot.ylabel('Log Loss')
[2, 4, 6, 8]
Fitting 10 folds for each of 16 candidates, totalling 160 fits
[Parallel(n_jobs=-1)]: Using backend LokyBackend with 4 concurrent workers.
[Parallel(n_jobs=-1)]: Done  42 tasks      | elapsed:    8.6s
[Parallel(n_jobs=-1)]: Done 160 out of 160 | elapsed:  1.1min finished
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_search.py:849: FutureWarning:

The parameter 'iid' is deprecated in 0.22 and will be removed in 0.24.

Best: -0.357098 using {'max_depth': 2, 'n_estimators': 50}
-0.357098 (0.056692) with: {'max_depth': 2, 'n_estimators': 50}
-0.373042 (0.068481) with: {'max_depth': 2, 'n_estimators': 100}
-0.385426 (0.074841) with: {'max_depth': 2, 'n_estimators': 150}
-0.392391 (0.078139) with: {'max_depth': 2, 'n_estimators': 200}
-0.372726 (0.073106) with: {'max_depth': 4, 'n_estimators': 50}
-0.410393 (0.094496) with: {'max_depth': 4, 'n_estimators': 100}
-0.437988 (0.113429) with: {'max_depth': 4, 'n_estimators': 150}
-0.466359 (0.127664) with: {'max_depth': 4, 'n_estimators': 200}
-0.386799 (0.090680) with: {'max_depth': 6, 'n_estimators': 50}
-0.439307 (0.126254) with: {'max_depth': 6, 'n_estimators': 100}
-0.480956 (0.142816) with: {'max_depth': 6, 'n_estimators': 150}
-0.510842 (0.159092) with: {'max_depth': 6, 'n_estimators': 200}
-0.395697 (0.101834) with: {'max_depth': 8, 'n_estimators': 50}
-0.449124 (0.133029) with: {'max_depth': 8, 'n_estimators': 100}
-0.483625 (0.150359) with: {'max_depth': 8, 'n_estimators': 150}
-0.502735 (0.159152) with: {'max_depth': 8, 'n_estimators': 200}
Out[85]:
Text(0, 0.5, 'Log Loss')
In [86]:
xgb = XGBClassifier(objective='binary:logistic', random_state=7, n_jobs=-1)
xgb.fit(x_train_scaled, y_train)
scores = cross_val_score(xgb, x_train_scaled, y_train, cv=kfold, scoring='brier_score_loss')
print('Brier loss:', "{0:.5f}".format(np.mean(scores)*-1))
Brier loss: 0.11885
In [87]:
print(xgb.get_params())
{'base_score': 0.5, 'booster': 'gbtree', 'colsample_bylevel': 1, 'colsample_bynode': 1, 'colsample_bytree': 1, 'gamma': 0, 'learning_rate': 0.1, 'max_delta_step': 0, 'max_depth': 3, 'min_child_weight': 1, 'missing': None, 'n_estimators': 100, 'n_jobs': -1, 'nthread': None, 'objective': 'binary:logistic', 'random_state': 7, 'reg_alpha': 0, 'reg_lambda': 1, 'scale_pos_weight': 1, 'seed': None, 'silent': None, 'subsample': 1, 'verbosity': 1}
In [88]:
from sklearn.model_selection import RandomizedSearchCV

# Create the parameter grid
params = {
    'learning_rate': [0.0001, 0.001, 0.01, 0.1, 0.2, 0.3],
    'n_estimators': [int(x) for x in np.linspace(start=100, stop=500, num=9)],
    'max_depth': [i for i in range(3, 10)],
    'min_child_weight': [i for i in range(1, 7)],
    'subsample': [i/10.0 for i in range(6,11)],
    'colsample_bytree': [i/10.0 for i in range(6,11)]
}
 
# Create the randomised grid search model
# "n_iter = number of parameter settings that are sampled. n_iter trades off runtime vs quality of the solution"
rgs = RandomizedSearchCV(estimator=xgb, param_distributions=params, n_iter=200, cv=kfold, 
                         random_state=7, n_jobs=-1,
                         scoring='brier_score_loss', return_train_score=True)
# Fit rgs
rgs.fit(x_train_scaled, y_train)
 
# Print results
print(rgs)
RandomizedSearchCV(cv=StratifiedKFold(n_splits=10, random_state=7, shuffle=True),
                   estimator=XGBClassifier(n_jobs=-1, random_state=7),
                   n_iter=200, n_jobs=-1,
                   param_distributions={'colsample_bytree': [0.6, 0.7, 0.8, 0.9,
                                                             1.0],
                                        'learning_rate': [0.0001, 0.001, 0.01,
                                                          0.1, 0.2, 0.3],
                                        'max_depth': [3, 4, 5, 6, 7, 8, 9],
                                        'min_child_weight': [1, 2, 3, 4, 5, 6],
                                        'n_estimators': [100, 150, 200, 250,
                                                         300, 350, 400, 450,
                                                         500],
                                        'subsample': [0.6, 0.7, 0.8, 0.9, 1.0]},
                   random_state=7, return_train_score=True,
                   scoring='brier_score_loss')
In [89]:
best_score = rgs.best_score_
best_params = rgs.best_params_
print("Best score: {}".format(best_score))
print("Best params: ")
for param_name in sorted(best_params.keys()):
    print('%s: %r' % (param_name, best_params[param_name]))
Best score: -0.10631241913018777
Best params: 
colsample_bytree: 1.0
learning_rate: 0.01
max_depth: 3
min_child_weight: 3
n_estimators: 200
subsample: 1.0
In [90]:
# make predictions for test data and evaluate
rgs_pred = rgs.predict(X_test_scaled)

print('Accuracy: ', round(accuracy_score(y_test, rgs_pred)*100, 2))
print( 'Cohen Kappa: '+ str(np.round(cohen_kappa_score(y_test, rgs_pred),3)))
print('Recall: ', round(recall_score(y_test, rgs_pred)*100, 2))
print('\n Classification Report:\n', classification_report(y_test, rgs_pred))

print(result.mean())
Accuracy:  82.0
Cohen Kappa: 0.58
Recall:  84.31

 Classification Report:
               precision    recall  f1-score   support

           0       0.94      0.81      0.87       149
           1       0.61      0.84      0.70        51

    accuracy                           0.82       200
   macro avg       0.77      0.83      0.79       200
weighted avg       0.85      0.82      0.83       200

nan
In [91]:
xgb = XGBClassifier()

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('XGB', XGBClassifier()))
              
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, x_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

XGB: 0.828750 (0.023083)
In [92]:
# Fit rgs
model.fit(x_train_scaled, y_train)

# make predictions for test data
y_pred = model.predict(X_test_scaled)
predictions = [round(value) for value in y_pred]

# evaluate predictions
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))
Accuracy: 79.50%
In [93]:
from sklearn.metrics import average_precision_score
average_precision = average_precision_score(y_test, rgs_pred)

print('Average precision-recall score: {0:0.2f}'.format(
      average_precision))
Average precision-recall score: 0.55
In [94]:
from sklearn.metrics import precision_recall_curve
from inspect import signature

plt.rcParams['figure.figsize'] = [10, 6]

precision, recall, _ = precision_recall_curve(y_test, rgs_pred)

step_kwargs = ({'step': 'post'}
               if 'step' in signature(plt.fill_between).parameters
               else {})
plt.step(recall, precision, color='b', alpha=0.2,where='post')
plt.fill_between(recall, precision, alpha=0.2, color='b', **step_kwargs)
plt.xlabel('Recall')
plt.ylabel('Precision')
plt.ylim([0.0, 1.05])
plt.xlim([0.0, 1.0])
plt.title('Precision-Recall curve: AP={0:0.2f}'.format(average_precision), fontsize=12)
Out[94]:
Text(0.5, 1.0, 'Precision-Recall curve: AP=0.55')
In [95]:
from sklearn.metrics import roc_curve
from sklearn.metrics import roc_auc_score

# calculate AUC
auc = roc_auc_score(y_test, rgs_pred)
print('AUC: %.3f' % auc)

# calculate roc curve
fpr, tpr, thresholds = roc_curve(y_test, rgs_pred)

# plot no skill
plt.rcParams['figure.figsize'] = [10, 6]
plt.plot([0, 1], [0, 1], linestyle='--')

# plot the roc curve for the model
plt.plot(fpr, tpr, marker='.')
AUC: 0.828
Out[95]:
[<matplotlib.lines.Line2D at 0x2066ce84748>]
In [96]:
from sklearn.metrics import confusion_matrix
import itertools

#Evaluation of Model - Confusion Matrix Plot
def plot_confusion_matrix(cm, classes, title ='Confusion matrix', normalize = False, cmap = plt.cm.Blues):
    
    print('Confusion matrix')

    print(cm)
    
    plt.style.use('fivethirtyeight')
    fig = plt.figure(figsize=(10,6))

    plt.imshow(cm, interpolation='nearest', cmap=cmap)
    plt.title(title)
    plt.colorbar()
    tick_marks = np.arange(len(classes))
    plt.xticks(tick_marks, classes, rotation=40)
    plt.yticks(tick_marks, classes)

    fmt = '.2f' if normalize else 'd'
    thresh = cm.max() / 2.
    for i, j in itertools.product(range(cm.shape[0]), range(cm.shape[1])):
        plt.text(j, i, format(cm[i, j], fmt),
                 horizontalalignment="center",
                 color="white" if cm[i, j] > thresh else "black")

    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    plt.tight_layout()


# Compute confusion matrix
cnf_matrix = confusion_matrix(y_test, rgs_pred)
np.set_printoptions(precision=2)

# Plot confusion matrix
plt.figure()
plot_confusion_matrix(cnf_matrix, classes=['Fraud_Y','Fraud_N'],
                      title='Confusion matrix')
Confusion matrix
[[121  28]
 [  8  43]]
<Figure size 720x432 with 0 Axes>
In [97]:
from sklearn.feature_selection import VarianceThreshold

constant_filter = VarianceThreshold(threshold=0.057) 
constant_filter.fit(X_train_scaled)  

constant_columns = [column for column in X_train_scaled.columns  
                    if column not in X_train_scaled.columns[constant_filter.get_support()]]

print(len(constant_columns))
0
In [98]:
correlated_features = set()  
correlation_matrix = X_train_scaled.corr()  

for i in range(len(correlation_matrix .columns)):  
    for j in range(i):
        if abs(correlation_matrix.iloc[i, j]) > 0.8:
            colname = correlation_matrix.columns[i]
            correlated_features.add(colname)
            
len(correlated_features)
Out[98]:
10
In [99]:
print(correlated_features)
{'injury_claim', 'insured_sex_MALE', 'csl_per_accident_300', 'csl_per_accident_1000', 'number_of_vehicles_involved', 'age', 'csl_per_accident_500', 'auto_model_Wrangler', 'property_claim', 'vehicle_claim'}
In [100]:
X.head(1)
Out[100]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_sex_MALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_Wrangler auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 csl_per_accident_1000 csl_per_accident_300 csl_per_accident_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight property_damage police_report_available collision_en months_as_customer age policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss number_of_vehicles_involved bodily_injuries witnesses total_claim_amount injury_claim property_claim vehicle_claim vehicle_age
0 0 0 1 0 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 1 0 1 0 0 0 0 0 1 1 3 328 48 1000 1406.91 0 53300 0 1 1 2 71610 6510 13020 52080 14
In [101]:
x = X.drop([ 
       'vehicle_claim', 
        'injury_claim', 
        'age', 
        'csl_per_accident_500', 
        'csl_per_accident_1000', 
        'auto_model_Wrangler', 
        'insured_sex_MALE', 
        'csl_per_accident_300', 
        'property_claim', 
        'number_of_vehicles_involved'], axis=1)

x.head(1)
Out[101]:
policy_state_IL policy_state_IN policy_state_OH insured_sex_FEMALE insured_education_level_Associate insured_education_level_College insured_education_level_High School insured_education_level_JD insured_education_level_MD insured_education_level_Masters insured_education_level_PhD insured_occupation_adm-clerical insured_occupation_armed-forces insured_occupation_craft-repair insured_occupation_exec-managerial insured_occupation_farming-fishing insured_occupation_handlers-cleaners insured_occupation_machine-op-inspct insured_occupation_other-service insured_occupation_priv-house-serv insured_occupation_prof-specialty insured_occupation_protective-serv insured_occupation_sales insured_occupation_tech-support insured_occupation_transport-moving insured_hobbies_base-jumping insured_hobbies_basketball insured_hobbies_board-games insured_hobbies_bungie-jumping insured_hobbies_camping insured_hobbies_chess insured_hobbies_cross-fit insured_hobbies_dancing insured_hobbies_exercise insured_hobbies_golf insured_hobbies_hiking insured_hobbies_kayaking insured_hobbies_movies insured_hobbies_paintball insured_hobbies_polo insured_hobbies_reading insured_hobbies_skydiving insured_hobbies_sleeping insured_hobbies_video-games insured_hobbies_yachting insured_relationship_husband insured_relationship_not-in-family insured_relationship_other-relative insured_relationship_own-child insured_relationship_unmarried insured_relationship_wife incident_type_Multi-vehicle Collision incident_type_Parked Car incident_type_Single Vehicle Collision incident_type_Vehicle Theft incident_severity_Major Damage incident_severity_Minor Damage incident_severity_Total Loss incident_severity_Trivial Damage authorities_contacted_Ambulance authorities_contacted_Fire authorities_contacted_None authorities_contacted_Other authorities_contacted_Police incident_state_NC incident_state_NY incident_state_OH incident_state_PA incident_state_SC incident_state_VA incident_state_WV incident_city_Arlington incident_city_Columbus incident_city_Hillsdale incident_city_Northbend incident_city_Northbrook incident_city_Riverwood incident_city_Springfield auto_make_Accura auto_make_Audi auto_make_BMW auto_make_Chevrolet auto_make_Dodge auto_make_Ford auto_make_Honda auto_make_Jeep auto_make_Mercedes auto_make_Nissan auto_make_Saab auto_make_Suburu auto_make_Toyota auto_make_Volkswagen auto_model_3 Series auto_model_92x auto_model_93 auto_model_95 auto_model_A3 auto_model_A5 auto_model_Accord auto_model_C300 auto_model_CRV auto_model_Camry auto_model_Civic auto_model_Corolla auto_model_E400 auto_model_Escape auto_model_F150 auto_model_Forrestor auto_model_Fusion auto_model_Grand Cherokee auto_model_Highlander auto_model_Impreza auto_model_Jetta auto_model_Legacy auto_model_M5 auto_model_MDX auto_model_ML350 auto_model_Malibu auto_model_Maxima auto_model_Neon auto_model_Passat auto_model_Pathfinder auto_model_RAM auto_model_RSX auto_model_Silverado auto_model_TL auto_model_Tahoe auto_model_Ultima auto_model_X5 auto_model_X6 csl_per_person_100 csl_per_person_250 csl_per_person_500 incident_period_of_day_afternoon incident_period_of_day_early_morning incident_period_of_day_evening incident_period_of_day_fore-noon incident_period_of_day_morning incident_period_of_day_night incident_period_of_day_past_midnight property_damage police_report_available collision_en months_as_customer policy_deductable policy_annual_premium umbrella_limit capital-gains capital-loss bodily_injuries witnesses total_claim_amount vehicle_age
0 0 0 1 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 0 0 1 0 1 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 1 0 0 0 0 0 1 1 3 328 1000 1406.91 0 53300 0 1 2 71610 14
In [102]:
x_train, x_test, y_train, y_test = train_test_split(x, y, train_size=0.8, random_state=7)
print('length of X_train and X_test: ', len(x_train), len(x_test))
print('length of y_train and y_test: ', len(y_train), len(y_test))
length of X_train and X_test:  800 200
length of y_train and y_test:  800 200
In [103]:
a_train_scaled = scaler.fit_transform(x_train)
a_test_scaled = scaler.transform(x_test)
In [104]:
xgb = XGBClassifier()
logreg= LogisticRegressionCV(solver='lbfgs', cv=10)

# prepare configuration for cross validation test harness
seed = 7
# prepare models
models = []
models.append(('LR', LogisticRegressionCV(solver='lbfgs', max_iter=5000, cv=10)))
models.append(('XGB', XGBClassifier()))
              
# evaluate each model in turn
results = []
names = []
scoring = 'accuracy'
for name, model in models:
    kfold = model_selection.KFold(n_splits=10, random_state=seed)
    cv_results = model_selection.cross_val_score(model, a_train_scaled, y_train, cv=kfold, scoring=scoring)
    results.append(cv_results)
    names.append(name)
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

LR: 0.822500 (0.037417)
C:\ProgramData\Anaconda3\lib\site-packages\sklearn\model_selection\_split.py:297: FutureWarning:

Setting a random_state has no effect since shuffle is False. This will raise an error in 0.24. You should leave random_state to its default (None), or set shuffle=True.

XGB: 0.820000 (0.033166)
In [ ]: